Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

Conversation

conradarcturus
Copy link
Contributor

CLDR-17897

While we are improving the population data and likely subtags we are generating side-effects from partial data. This adds new scripts so we can avoid these side-effects in future changes. Ultimately we will want to remove how many overrides are here but it's good to fix this.

See the data updated in this diagram:
Screenshot 2024-08-29 at 15 29 41

  • This PR completes the ticket. -- I'm submitting this request first to separate the changes
  • This PR stabilizes the data so its easier to follow up and fix other overrides.

Run this command to regenerate data: mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

ALLOW_MANY_COMMITS=true

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

… and likely subtag overrides

The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs.

CLDR-17897 Add overrides to Likely Subtags
@conradarcturus conradarcturus force-pushed the CLDR-17884-Add-primary-scripts branch from 3c0661a to e5fa96d Compare August 29, 2024 22:42
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

… and likely subtag overrides

The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs.

CLDR-17897 Add overrides to Likely Subtags
@conradarcturus conradarcturus force-pushed the CLDR-17884-Add-primary-scripts branch from da92bfe to 0756612 Compare August 29, 2024 23:24
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@@ -703,10 +706,10 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="tiv" to="tiv_Latn_NG"/> <!--Tiv‧?‧? ➡ Tiv‧Latin‧Nigeria-->
<likelySubtag from="tk" to="tk_Latn_TM"/> <!--Turkmen‧?‧? ➡ Turkmen‧Latin‧Turkmenistan-->
<likelySubtag from="tkl" to="tkl_Latn_TK"/> <!--Tokelau‧?‧? ➡ Tokelau‧Latin‧Tokelau-->
<likelySubtag from="tkr" to="tkr_Latn_AZ"/> <!--Tsakhur‧?‧? ➡ Tsakhur‧Latin‧Azerbaijan-->
<likelySubtag from="tkr" to="tkr_Cyrl_AZ"/> <!--Tsakhur‧?‧? ➡ Tsakhur‧Cyrillic‧Azerbaijan-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. Tsakhur is written in Latin in Azerbaijan and in Cyrillic in Russia. The old value was correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the close examination -- I'll re-introduce the overrides for these languages. I'm having a problem fighting the different sources of truth :p Definitely Latn should be considered the primary script in Azerbaijan.

I think the source problem is that "Cyrl" comes before "Latn" alphabetically and when the script is re-run now it takes the first alphabetical item.

@@ -725,7 +728,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="tt" to="tt_Cyrl_RU"/> <!--Tatar‧?‧? ➡ Tatar‧Cyrillic‧Russia-->
<likelySubtag from="ttj" to="ttj_Latn_UG"/> <!--Tooro‧?‧? ➡ Tooro‧Latin‧Uganda-->
<likelySubtag from="tts" to="tts_Thai_TH"/> <!--Northeastern Thai‧?‧? ➡ Northeastern Thai‧Thai‧Thailand-->
<likelySubtag from="ttt" to="ttt_Latn_AZ"/> <!--Muslim Tat‧?‧? ➡ Muslim Tat‧Latin‧Azerbaijan-->
<likelySubtag from="ttt" to="ttt_Cyrl_AZ"/> <!--Muslim Tat‧?‧? ➡ Muslim Tat‧Cyrillic‧Azerbaijan-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem. Muslim Tat is written in Latin in Azerbaijan and in Cyrillic in Russia.

@@ -1036,6 +1039,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/> <!--?‧Ahom‧? ➡ Ahom‧Ahom‧India-->
<likelySubtag from="und_Arab" to="ar_Arab_EG"/> <!--?‧Arabic‧? ➡ Arabic‧Arabic‧Egypt-->
<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/> <!--?‧Arabic‧Afghanistan ➡ Persian‧Arabic‧Afghanistan-->
<likelySubtag from="und_Arab_AZ" to="tly_Arab_AZ"/> <!--?‧Arabic‧Azerbaijan ➡ Talysh‧Arabic‧Azerbaijan-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have something unknown in Arabic script in Azerbaijan, it's probably not Talysh (which has a pretty small community compared to Azerbaijani, where they write in Latin). It's very probably Azerbaijani in the old orthography.

@@ -1890,7 +1890,7 @@ XXX Code for transations where no currency is involved
<language type="lv" scripts="Latn" territories="LV"/>
<language type="lwl" scripts="Thai"/>
<language type="lzh" scripts="Hans" alt="secondary"/>
<language type="lzz" scripts="Latn Geor"/>
<language type="lzz" scripts="Geor Latn"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering significant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is writing it alphabetically, so when I re-generate the script it force-alphabetizes it. There is an argument it should be ordered by usage --- however the XML is just not a good way to capture this because the labelling is unclear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scripts (and regions) should be in ranked order, not sorted. If the code is sorting them, that's a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if Roozbehs' items are taken care of, then this would be ready to merge into 47.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad Roozbeh took a look ;) I resolved this by changing the non-Latn script for these languages to be considered "secondary" in language_script.tsv

Merging this changes ended up getting really messy so I'll post a new pull request.

Comment on lines +647 to +648
pnt Pontic secondary Cyrl Cyrillic
pnt Pontic secondary Latn Latin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the basis of making these secondary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoting Grek to be the primary script for Pontic.

Really for all current Pontic speakers its Grek in Greece, Latn in Turkey, and Cyrl in Russia/Ukraine. Pontic is only spoken by very marginal populations in Turkey and Russia, but it's a large recognized community in Greece.

What's the basis for primary v secondary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary vs secondary should be based on the literate population sizes. I forget what the cutoff is, but clearly if >50% of the usage of the language is in a particular script, that would be primary, not secondary. (But again, there might be bug in the code.)

@@ -392,6 +393,9 @@ public static void main(String[] args) throws IOException {
{"mro", "mro_Mroo_BD"},
{"mro_BD", "mro_Mroo_BD"},
{"ms_Arab", "ms_Arab_MY"},
{"nan", "nan_Hans_CN"},
{"nan_Hans", "nan_Hans_CN"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't hurt anything, but nan_Hans is redundant, because the algorithm will find {"nan", "nan_Hans_CN"}, and fill in.

There is a ticket open for dropping overrides that have no effect, so it is ok to keep this line for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I need to keep this like otherwise in the produced likelySubtags.xml file, it will show nan_Hans -> nan_Hans_TW even though as we know the Hant script would be preferred in Taiwan. The problem is that we don't have population estimates on Simplified v Traditional Chinese script usage.

@conradarcturus
Copy link
Contributor Author

Thanks everyone for the comments! It helped me make a better version of this PR in #4015.

Apologies for making a separate one -- rebasing it to the ddl/v47 branch introduced weird merge artifacts so I just made a new PR.

@conradarcturus conradarcturus deleted the CLDR-17884-Add-primary-scripts branch September 4, 2024 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants